Job Management Guide (4)

By Hongyu Xiao

Using SLURM for Efficient Computing

While Jupyter notebook access through tunneling is available, using SLURM for job management often provides better efficiency and resource utilization. Here's my template for a basic SLURM script:

Here's an example of a GPU-enabled SLURM script for deep learning tasks:

#!/bin/bash
#SBATCH --partition=disc_dual_a100    # GPU partition
#SBATCH --gres=gpu:1                  # Request 1 GPU
#SBATCH --output=job_%J_.txt          # Output file
#SBATCH --error=job_%J_.txt           # Error file
#SBATCH --ntasks=1                    # Number of tasks
#SBATCH --mem=100G                     # Memory request
#SBATCH --time=24:00:00              # Time limit

# Run your deep learning script
python your_training_script.py

When using GPU resources, make sure to specify the appropriate partition (disc_dual_a100) and request GPU resources using the --gres flag. This ensures your job gets scheduled on nodes with available GPUs.

To submit your SLURM job, use:

sbatch your_script.sbatch

Common SLURM commands for job management:

squeue -u $USER # Check your job queue

scancel job_id # Cancel a specific job

sinfo # Check partition information

This approach allows for better resource management and more efficient execution of computational tasks compared to interactive notebook sessions.

Here are examples of using squeue and grep to monitor jobs:

# View all jobs in the queue
$ squeue
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
123456 disc_dual python_tr hongyux  R    2:30:15      1 node001
123457 disc_dual tensor_jo  user2   R   12:45:22      1 node002
123458 disc_dual pytorch_t  user3  PD    0:00:00      1 (Resources)

# Filter jobs on disc partitions
$ squeue | grep disc
123456 disc_dual python_tr hongyux  R    2:30:15      1 node001
123457 disc_dual tensor_jo  user2   R   12:45:22      1 node002
123458 disc_dual pytorch_t  user3  PD    0:00:00      1 (Resources)
123459 disc_a100 train_ml  user4   R    5:12:33      1 node003

The output shows job ID, partition name, job name, user, status (R=running, PD=pending), runtime, number of nodes, and node assignment or reason for pending.

Advanced SLURM Usage Tips

Here are some additional SLURM commands and features that can help you manage your computational jobs more effectively:

1. Job Dependencies

You can make jobs wait for other jobs to complete:

# Wait for job 123456 to complete before starting
sbatch --dependency=afterok:123456 next_job.sbatch

# Wait for job 123456 to fail before starting
sbatch --dependency=afternotok:123456 cleanup_job.sbatch

2. Resource Monitoring

Monitor your job's resource usage:

sstat - View resource usage of running jobs

sacct - View completed job information

# View detailed job information
sacct -j JobID --format=JobID,JobName,MaxRSS,Elapsed

# Monitor memory usage of running job
sstat --format=AveCPU,AveRSS,AveVMSize --jobs JobID